Incremental Web-Site Boundary Detection Using Random Walks
نویسندگان
چکیده
The paper describes variations of the classical k-means clustering algorithm that can be used effectively to address the so called Web-site Boundary Detection (WBD) problem. The suggested advantages offered by these techniques are that they can quickly identify most of the pages belonging to a web-site; and, in the long run, return a solution of comparable (if not better) accuracy than other clustering methods. We analyze our techniques on artificial clones of the web generated using a well-known preferential attachment method. keywords: Web Site Boundary Detection, Random Walk Techniques, Web Site Clustering,
منابع مشابه
Web-Site Boundary Detection Using Incremental RandomWalk Clustering
In this paper we describe a random walk clustering technique to address the Website Boundary Detection (WBD) problem. The technique is fully described and compared with alternative (breadth and depth first) approaches. The reported evaluation demonstrates that the random walk technique produces comparable or better results than those produced by these alternative techniques, while at the same t...
متن کاملRandom Walks in the Quarter Plane Absorbed at the Boundary : Exact and Asymptotic
Nearest neighbor random walks in the quarter plane that are absorbed when reaching the boundary are studied. The cases of positive and zero drift are considered. Absorption probabilities at a given time and at a given site are made explicit. The following asymptotics for these random walks starting from a given point (n0, m0) are computed : that of probabilities of being absorbed at a given sit...
متن کاملNumber of times a site is visited in two-dimensional random walks.
In this paper, formulas are derived to compute the mean number of times a site has been visited in a random walk on a two-dimensional lattice. Asymmetric random walks are considered, with or without drift, for different boundary conditions. It is shown that in case of absorbing boundaries the mean number of visits reaches stationary values over the lattice; comparisons with a Monte Carlo simula...
متن کاملSaddlepoint Approximations and Nonlinear Boundary Crossing Probabilities of Markov Random Walks by Hock
Saddlepoint approximations are developed for Markov random walks Sn and are used to evaluate the probability that (j − i)g((Sj − Si)/(j − i)) exceeds a threshold value for certain sets of (i, j). The special case g(x) = x reduces to the usual scan statistic in change-point detection problems, and many generalized likelihood ratio detection schemes are also of this form with suitably chosen g. W...
متن کاملWeb-Site Boundary Detection
Defining the boundaries of a web-site, for (say) archiving or information retrieval purposes, is an important but complicated task. In this paper a web-page clustering approach to boundary detection is suggested. The principal issue is feature selection, hampered by the observation that there is no clear understanding of what a web-site is. This paper proposes a definition of a web-site, founde...
متن کامل